Goto

Collaborating Authors

 real-world scene





A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input

Neural Information Processing Systems

We propose a method for automatically answering questions about images by bringing together recent advances from natural language processing and computer vision. We combine discrete reasoning with uncertain predictions by a multi-world approach that represents uncertainty about the perceived world in a bayesian framework. Our approach can handle human questions of high complexity about realistic scenes and replies with range of answer like counts, object classes, instances and lists of them. The system is directly trained from question-answer pairs. We establish a first benchmark for this task that can be seen as a modern attempt at a visual turing test.


Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions

Jian, Pu, Yu, Donglei, Yang, Wen, Ren, Shuo, Zhang, Jiajun

arXiv.org Artificial Intelligence

In visual question answering (VQA) context, users often pose ambiguous questions to visual language models (VLMs) due to varying expression habits. Existing research addresses such ambiguities primarily by rephrasing questions. These approaches neglect the inherently interactive nature of user interactions with VLMs, where ambiguities can be clarified through user feedback. However, research on interactive clarification faces two major challenges: (1) Benchmarks are absent to assess VLMs' capacity for resolving ambiguities through interaction; (2) VLMs are trained to prefer answering rather than asking, preventing them from seeking clarification. To overcome these challenges, we introduce \textbf{ClearVQA} benchmark, which targets three common categories of ambiguity in VQA context, and encompasses various VQA scenarios.


Supplementary Material: Neural Transmitted Radiance Fields

Neural Information Processing Systems

The transmission likelihood and reflection likelihood are displayed with scaled color. The recurring edges in our framework act as a pilot to guide the training. This is based on a simple observation verified by a number of pioneer reflection removal methods [1, 2]. In this observation, the transmission repeatedly appears among a set of images, while the reflection only has sparse presence. Besides the examples shown in Figure 1 of our main paper, we additionally show an example in this supplementary.


From an Image to a Scene: Learning to Imagine the World from a Million 360 Videos

Wallingford, Matthew, Bhattad, Anand, Kusupati, Aditya, Ramanujan, Vivek, Deitke, Matt, Kakade, Sham, Kembhavi, Aniruddha, Mottaghi, Roozbeh, Ma, Wei-Chiu, Farhadi, Ali

arXiv.org Artificial Intelligence

Three-dimensional (3D) understanding of objects and scenes play a key role in humans' ability to interact with the world and has been an active area of research in computer vision, graphics, and robotics. Large scale synthetic and object-centric 3D datasets have shown to be effective in training models that have 3D understanding of objects. However, applying a similar approach to real-world objects and scenes is difficult due to a lack of large-scale data. Videos are a potential source for real-world 3D data, but finding diverse yet corresponding views of the same content has shown to be difficult at scale. Furthermore, standard videos come with fixed viewpoints, determined at the time of capture. This restricts the ability to access scenes from a variety of more diverse and potentially useful perspectives. We argue that large scale 360 videos can address these limitations to provide: scalable corresponding frames from diverse views. In this paper, we introduce 360-1M, a 360 video dataset, and a process for efficiently finding corresponding frames from diverse viewpoints at scale. We train our diffusion-based model, Odin, on 360-1M. Empowered by the largest real-world, multi-view dataset to date, Odin is able to freely generate novel views of real-world scenes. Unlike previous methods, Odin can move the camera through the environment, enabling the model to infer the geometry and layout of the scene. Additionally, we show improved performance on standard novel view synthesis and 3D reconstruction benchmarks.


CRiM-GS: Continuous Rigid Motion-Aware Gaussian Splatting from Motion Blur Images

Lee, Junghe, Kim, Donghyeong, Lee, Dogyoon, Cho, Suhwan, Lee, Sangyoun

arXiv.org Artificial Intelligence

Neural radiance fields (NeRFs) have received significant attention due to their high-quality novel view rendering ability, prompting research to address various real-world cases. One critical challenge is the camera motion blur caused by camera movement during exposure time, which prevents accurate 3D scene reconstruction. In this study, we propose continuous rigid motion-aware gaussian splatting (CRiM-GS) to reconstruct accurate 3D scene from blurry images with real-time rendering speed. Considering the actual camera motion blurring process, which consists of complex motion patterns, we predict the continuous movement of the camera based on neural ordinary differential equations (ODEs). Specifically, we leverage rigid body transformations to model the camera motion with proper regularization, preserving the shape and size of the object. Furthermore, we introduce a continuous deformable 3D transformation in the \textit{SE(3)} field to adapt the rigid body transformation to real-world problems by ensuring a higher degree of freedom. By revisiting fundamental camera theory and employing advanced neural network training techniques, we achieve accurate modeling of continuous camera trajectories. We conduct extensive experiments, demonstrating state-of-the-art performance both quantitatively and qualitatively on benchmark datasets.


Divide and Conquer: Rethinking the Training Paradigm of Neural Radiance Fields

Ma, Rongkai, Lebrat, Leo, Cruz, Rodrigo Santa, Avraham, Gil, Zuo, Yan, Fookes, Clinton, Salvado, Olivier

arXiv.org Artificial Intelligence

Neural radiance fields (NeRFs) have exhibited potential in synthesizing high-fidelity views of 3D scenes but the standard training paradigm of NeRF presupposes an equal importance for each image in the training set. This assumption poses a significant challenge for rendering specific views presenting intricate geometries, thereby resulting in suboptimal performance. In this paper, we take a closer look at the implications of the current training paradigm and redesign this for more superior rendering quality by NeRFs. Dividing input views into multiple groups based on their visual similarities and training individual models on each of these groups enables each model to specialize on specific regions without sacrificing speed or efficiency. Subsequently, the knowledge of these specialized models is aggregated into a single entity via a teacher-student distillation paradigm, enabling spatial efficiency for online render-ing. Empirically, we evaluate our novel training framework on two publicly available datasets, namely NeRF synthetic and Tanks&Temples. Our evaluation demonstrates that our DaC training pipeline enhances the rendering quality of a state-of-the-art baseline model while exhibiting convergence to a superior minimum.


AI already turns text prompts into stunning art. Next up: video

PCWorld

Runway has shouldered aside Midjourney and Stable Diffusion, introducing the first clips of text-to-video AI art that the company says is completely generated by a text prompt. The company said that it's offering a waitlist to join what it calls "Gen 2" of text-to-video AI, after offering a similar waitlist for its first, simpler text-to-video tools that use a real-world scene as a model. When AI art emerged last year, it used a text-to-image model. A user would input a text prompt describing the scene, and the tool would attempt to create an image using what it knew of real-world "seeds," artistic styles and so forth. Services like Midjourney perform these tasks on a cloud server, while Stable Diffusion and Stable Horde take advantage of similar AI models running on home PCs.